Skip to main content

FAQ

In this document, you can find frequently asked questions regarding your training.

Where do I find logs?

Easiest way to find logs is to head to the menu on the left side of the AML and open "Jobs" tab. Here, you can find different jobs you have executed, filter them by experiments, compute... Find the job you are looking for and open it. FL logs are usually part of the 70_driver_log.txt. Every client has their own logs and server also has its logs.

Fig. 1 - Output and logs

Where do I find metrics?

It is necessary to navigate to the server job in Azure ML and in the "Metrics tab" (visible also in Fig. 1), you can find metrics for all clients. In the current setup, clients don't stream metrics to their intance of MLFlow, instead they stream it to the server' MLFlow instance.

Where do I find a model?

After the model has been trained, it should be submitted to the MLFlow and findable via "Models" tab. Currently there is a small bug in NVFlare, where the global model doesn't get selected and we do it by hand, thanks to the custom persistor. Custom persistor saves every single server model to the local directory under /models/ directory. This should be resolved by NVFlare 2.3 in mid April.

How to check if the job is running?

There are 2 ways:

  1. Checking the logs in the clients/server
  2. Checking it through the Runner API - in the cell under "Check Flare job status" description in jupyter notebook

My job is stuck in the "queued" state

It has several possible reasons. Before starting every training, every job will be in Preparing/Queued state for a while. However, if it doesn't change, there may be several reasons:

  1. If it says that it currently can't satisfy resources, there is possibility that something is blocking the resources on the machine (old running job?). You can double check under Compute -> Kubernetes clusters/Attached computes -> click on the machine you want -> Jobs
  2. Cluster may be not reacheable currently
  3. VM may not be started (or restarted) properly

Job failed with SSH error

In this case, your VM is likely not started properly.

Job failed with docker error

Currently AML doesn't clean up after cancelled jobs properly, please either terminate the container by hand or use one of the utility scripts to restart the VM.